PRÁCTICA PROCESAMIENTO DEL LENGUAJE NATURAL 2022

AUTORES: Carlos Ramos Mateos y Adrián Rubio Pintado

2022

In [1]:
import json
import re
import pandas as pd
import plotly as pppppp
from plotly import version
import plotly.express as px
import nltk
from nltk.parse.corenlp import CoreNLPDependencyParser
from nltk.corpus import wordnet as wn
from nltk.corpus import opinion_lexicon
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.vader import VaderConstants
from nltk.corpus import sentiwordnet as swn
from nltk.parse.corenlp import CoreNLPDependencyParser

pppppp.__version__
Out[1]:
'5.6.0'
In [2]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('opinion_lexicon')
nltk.download('vader_lexicon')
nltk.download('sentiwordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Out[2]:
True

Aux. Functions:

Implementamos una función que nos devuelva una lista de una reviews dado una conujunto de datos.

In [3]:
def getReviews(data):
    reviewsT = []
    for review in data:
        reviewsT.append(review.get('reviewText'))
    return reviewsT

TASK 1

• Task 1.1 - mandatory: loading all the hotel reviews from the Yelp hotel reviews file. See Appendix A

Partiendo del código de ejemplo proporcionado en el enunciado de la practica.

In [4]:
with open('yelp_dataset/yelp_hotels.json', encoding='utf-8') as f:
    reviews = json.load(f)
f.close()

numReviews = len(reviews)
print(numReviews, 'reviews loaded')
5034 reviews loaded
In [5]:
print(reviews[0])
print(reviews[0].get('reviewerID'))
{'reviewerID': 'qLCpuCWCyPb4G2vN-WZz-Q', 'asin': '8ZwO9VuLDWJOXmtAdc7LXQ', 'summary': 'summary', 'reviewText': "Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!", 'overall': 4.0}
qLCpuCWCyPb4G2vN-WZz-Q

• Task 1.2 - optional [low difficulty]: loading line by line* the reviews from the Yelp beauty/spa resorts and restaurants reviews files

Escribimos una función para leer linea a linea los datasets, abrimos el archivo correspondiente y lo vamos recorriendo línea a linea, parseando el texto con saltos de línea.

In [6]:
def loadReview_byLine(dir_):
    data = []

    file = open(dir_, encoding='utf-8') 
    for line in file:
        if line != '[\n' and line != ']':
            line = line.rstrip(',\n')
            data.append(json.loads(line))
    file.close()
    
    return data
In [7]:
data_hotel = loadReview_byLine('yelp_dataset/yelp_hotels.json')
print(data_hotel[0])
{'reviewerID': 'qLCpuCWCyPb4G2vN-WZz-Q', 'asin': '8ZwO9VuLDWJOXmtAdc7LXQ', 'summary': 'summary', 'reviewText': "Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!", 'overall': 4.0}

Cargamos los datasets linea por linea:

In [8]:
data_spas = loadReview_byLine('yelp_dataset/yelp_beauty_spas.json')
print(data_spas[0])
{'reviewerID': 'Xm8HXE1JHqscXe5BKf0GFQ', 'asin': 'WGNIYMeXPyoWav1APUq7jA', 'summary': 'summary', 'reviewText': "Good tattoo shop. Clean space, multiple artists to choose from and books of their work are available for you to look though and decide who's style most mirrors what you're looking for. I chose Jet to do a cover-up for me and he worked with me on the design and our ideas and communication flowed very well. He's a very personable guy, is friendly and keeps the conversation going while he's working on you, and he doesn't dick around (read: He starts to work and continues until the job is done). He's very professional and informative. Good customer service combines with talent at the craft.", 'overall': 4.0}
In [9]:
data_restaurants = loadReview_byLine('yelp_dataset/yelp_restaurants.json')
print(data_restaurants[0])
{'reviewerID': 'rLtl8ZkDX5vH5nAx9C3q5Q', 'asin': '9yKzy9PApeiPPOUJEtnvkg', 'summary': 'summary', 'reviewText': 'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.Anyway, I can\'t wait to go back!', 'overall': 5.0}

TASK 2

Task 2.1 - mandatory: loading (and printing on screen) the vocabulary of the aspects_hotels.csv file, and directly using it to identify aspect references in the reviews. In particular, the aspects terms could be mapped by exact matching with nouns appearing in the reviews. See Appendix B

Definimos una función para cargar los aspectos de un fichero de texto y almacenarlos en estructura de diccionario. El vocabulario devuelto tiene estructura de diccionario: {"<"Aspect">": set("<"term1">", ..., "<"termN">")}

In [10]:
def dic_aspect(file):
    aspects = {}
    f = open(file, "r")
    
    for l in f:
        tokens = l.rstrip('\n').split(',')
        if tokens[0] not in aspects:
            aspects[tokens[0]] = set()
        aspects[tokens[0]].add(tokens[1])
       
    f.close()
    return aspects

Cargamos los vocabularios de los ficheros de aspectos proporcionados de los dominios de hotels, además de los dominios de books,cds, digitalmusic,etc.

In [11]:
aspects_hotel = dic_aspect("aspects/aspects_hotels.csv")
len(aspects_hotel)
Out[11]:
31

Obtenemos los 31 aspects, que son:

In [12]:
pd.DataFrame(aspects_hotel.keys(),  columns=['Aspects'])
Out[12]:
Aspects
0 amenities
1 atmosphere
2 bar
3 bathrooms
4 bedrooms
5 booking
6 breakfast
7 building
8 checking
9 cleanliness
10 coffee
11 cuisine
12 dinner
13 drinks
14 events
15 facilities
16 gym
17 internet
18 location
19 lunch
20 parking
21 pool
22 price
23 restaurant
24 restrooms
25 service
26 shopping
27 spa
28 staff
29 temperature
30 transportation

Imprimimos el vocabulario completo de los aspect terms para el ámbito de los hoteles, mediante una función auxiliar:

In [13]:
def print_aspects(aspects):
    data_tuples = list(zip(aspects.keys(),aspects.values()))
    return pd.DataFrame(data_tuples, columns=['Aspect','Term'])
In [14]:
print_aspects(aspects_hotel)
Out[14]:
Aspect Term
0 amenities {services, amenity, amenities}
1 atmosphere {light, lighting, ambiances, atmospheres, ambi...
2 bar {bartenders, bar, bartender, bars}
3 bathrooms {shampoos, bathtub, showers, towel, baths, bat...
4 bedrooms {sheets, bedroom, beds, sheet, suite, suites, ...
5 booking {book, reservation, booking, reservations, res...
6 breakfast {breakfast, toast, mornings, toasts, morning, ...
7 building {building, patios, architecture, furniture, de...
8 checking {checks, check out, registration, checkin, che...
9 cleanliness {clean, cleaning, smell, cleaned, dirt, dirty,...
10 coffee {cafe, tea, cafes, coffees, teas, coffee}
11 cuisine {meals, plate, dishe, buffets, food, cuisines,...
12 dinner {night meal, evening meal, evening menu, night...
13 drinks {wines, wine, beer, drink, drinks, beers}
14 events {activities, trips, event, trip, activity, par...
15 facilities {facility, equipment, facilities}
16 gym {gym, gyms}
17 internet {wireless, wi-fi, wi fi, wifi, internet}
18 location {beach, neighbourhoods, tree, street, lakes, r...
19 lunch {afternoon menu, noon meal, afternoon meal, lu...
20 parking {parking, parkings}
21 pool {swimming pools, pools, swimmingpool, swimming...
22 price {price, pricing, fee, priced, prices, fees, mo...
23 restaurant {restaurants, restaurant}
24 restrooms {toilets, restrooms, toilet, restroom}
25 service {serving, servers, attention, servings, attitu...
26 shopping {store, shops, boutique, shop, mall, stores, s...
27 spa {jacuzzis, jacuzzi, spa, spas, saunas, sauna}
28 staff {owners, bellhop, workers, receptionists, empl...
29 temperature {temperatures, temperature}
30 transportation {bus, cap, train, tubes, metros, taxi, caps, s...

Dada una review, para determinar aspectos, hacemos mediante "exact matching" la identificación de nombres dentro de la review. Si alguna de esas palabras aparece en el listado de un aspecto de nuestro vocabulario, identificamos como que hablamos de ese aspecto en la review. Para ello primero definimos la función:

In [15]:
import re
def get_review_apects(review, vocabulary):
    '''
        Args:
            review:text
            vocabulary: vocabulary of aspects
        Returns:
            pd dataframe with pairs word-aspect found in the review
    '''
    aspects = []
    terminos = []

    #Limpiando signos de puntuación
    text = re.sub('\ |\?|\.|\!|\/|\;|\:|\,', ' ', review)
    unique_words = set(text.lower().split(' '))
    #print('unique_words', unique_words)
    for a,terms in vocabulary.items():
        #first list compr. to accelerate the function
        found_words = [w for w in unique_words if w in terms]
        for w in found_words:
            aspects.append(a)
            terminos.append(w)
    
    rev_aspects = pd.DataFrame( {'Aspect': aspects, 'Term': terminos})

    return rev_aspects

Para la primera review del dataset e imprimimos sus aspectos encontrados junto con el término que hizo que identificáramos dicho aspecto en la review('exact matching').

In [16]:
data_hotel[0].get('reviewText')
Out[16]:
"Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!"
In [17]:
aspectsH = get_review_apects(data_hotel[0].get('reviewText'), aspects_hotel)
aspectsH
Out[17]:
Aspect Term
0 bar bar
1 building lobby
2 building patio
3 pool pool
4 shopping boutique
5 transportation car

Task 2.2 - optional [low difficulty]: generating or extending the lists of terms of each aspect with synonyms extracted from WordNet. See Appendix E

Dado que para un término puede haber más de un synset de acuerdo a diferentes definiciones del término, vamos a coger únicamente la primera acepción, por ser la más común. Esto es debido a que experimentalmente introduce muchos términos de media que no guardan contexto real con las opiniones, por lo que nos incluye 'basura' al modelo.

Por ejemplo en el caso de "amenity" nos introduce "sweetness", debido muy probablemente a que algun hiperónico/hipónimo de alguna de las acepciones que no es la primera.

Aún así, hemos decidido mantener únicamente la primera, dado que el alcance de nuestro dataset de hoteles es de fin generalista. Es decir, que al no requerir se muy tecnico dado el contexto, lo más probable es que en las reviews se utilicen palabras del lenguaje común, y no las acepciones más técnicas o con fin literario alto para las mismas.

Para extender la lista de términos, cogemos y para cada aspect extendemos su lista de términos tal que:

1. Extendemos con los sinónimos
2. Extendmos con los hipónimos(por ser realiaciones más concretas)
3. Extendemos por hiperónimos(al poder existir alguna palabra más generalista que el propio aspect )


De forma promedia, esperamos que por lo general la extensión con terminos sea más beneficiosa que perjudicial para nuestros aspectos.

Como trabajo futuro, una opción para mejorar esta función sería, que en vez de hacer una función que mire en aspectos de forma "estática", modificar la función para que además recibierá el contexto de la frase como un parámetro adicional, de tal forma que fueramos capaz de extraer el contexto donde se ubica la palabra, tomando del mismo la acepción y por lo tanto el aspecto más adecuado para la frase en cuestión.

In [18]:
import copy

def extend_vocabulary_wordnet(vocabulary):
    '''
        Args:
            Aspects vocabulary.
    
        Returns:
            Entended vocabulary
    '''
    extended_vocabulary = copy.deepcopy(vocabulary)

    for aspect in list(vocabulary.keys()):
        #print('antes',extended_vocabulary[aspect] )
        synsets = wn.synsets(aspect)

        if any(synsets): #cambio
            synset = synsets[0] #acepcione primera

            extended_vocabulary[aspect].update(synset.lemma_names())#sinonimos
            for h in synset.hypernyms():#hiperonimos
                extended_vocabulary[aspect].update(h.lemma_names())
            #extended_vocabulary[aspect].update(synset.hypernyms())#hiperonimos
            for h in synset.hyponyms():#hiponimos(mas especificos)
                extended_vocabulary[aspect].update(h.lemma_names())

            #print('despues',extended_vocabulary[aspect] )
      
    return extended_vocabulary
In [19]:
aspects_hotel_extended_wordnet = extend_vocabulary_wordnet(aspects_hotel)
print_aspects(aspects_hotel_extended_wordnet)
Out[19]:
Aspect Term
0 amenities {livelihood, creature_comforts, living, amenit...
1 atmosphere {flavor, light, miasm, lighting, status, music...
2 bar {bar, barrelhouse, saloon, cocktail_lounge, ta...
3 bathrooms {bathrooms, shampoos, bathtub, showers, towel,...
4 bedrooms {chamber, dorm_room, boudoir, guestroom, sleep...
5 booking {gig, work, employment, book, reservation, boo...
6 breakfast {breakfast, toast, mornings, toasts, morning, ...
7 building {building, butchery, presbytery, house_of_God,...
8 checking {suss_out, canvass, check-outs, check, check_u...
9 cleanliness {clean, cleaning, smell, cleaned, dirt, dirty,...
10 coffee {cafe, Turkish_coffee, java, potable, demitass...
11 cuisine {rechauffe, buffet, menu, plates, dim_sum, mea...
12 dinner {night meal, evening meal, high_tea, evening m...
13 drinks {hair_of_the_dog, wines, wine, tipple, whisky_...
14 events {activity, party, social_event, deed, act, eve...
15 facilities {power_system, assembly, forum, zoological_gar...
16 gym {gymnasium, gym, gyms, athletic_facility}
17 internet {wireless, computer_network, wi-fi, wi fi, cyb...
18 location {beach, neighbourhoods, southwest, tree, south...
19 lunch {afternoon menu, noon meal, meal, luncheon, ti...
20 parking {way, parkings, parking, room, elbow_room}
21 pool {wading_pool, cistern, swimming pools, pools, ...
22 price {marginal_cost, differential_cost, priced, mon...
23 restaurant {cafe, building, eating_place, diner, tea_parl...
24 restrooms {lav, can, privy, toilet, bathroom, public_lav...
25 service {utility, work, national_service, serving, ser...
26 shopping {store, shops, shop, buying, purchasing, bouti...
27 spa {jacuzzis, jacuzzi, watering_hole, spa, vacati...
28 staff {owners, office, research_staff, maintenance_s...
29 temperature {boiling_point, fundamental_quantity, high_tem...
30 transportation {bus, cap, train, transportation_system, tubes...

Observamos que ganamos muchos términos que nos serán útiles en el futuro.

Como problema vemos que nos intruduce guiones(altos y bajos) en algunos términos, como si fuera una palabra. Sin embargo, dado que es probable que algun usuario lo llegue a utilizar, y no nos perjudica el modelo, para la propuesta de "exact matching" para identificar aspectos de términos, lo dejamos así.

Vamos a ver que obtenemos para el la review 0 de los hoteles del ejercicio 2.1 para nuestro vocabulario extendido:

In [20]:
aspectsH_2 = get_review_apects(data_hotel[0].get('reviewText'), aspects_hotel_extended_wordnet)
aspectsH_2
Out[20]:
Aspect Term
0 bar bar
1 building hotel
2 building lobby
3 building patio
4 events happening
5 location here
6 pool pool
7 shopping boutique
8 transportation car

Vemos como el término hotel antes no nos lo identificaba, y además ganamos un nuevo aspect en la opinion, el de events, ya que la palabra happening nos dicta que podría tratarse de un evento.

Task 2.3 - optional [low difficulty]: managing vocabularies for additional Yelp or Amazon domains. See assignments 1.2 and 1.3

De igual modo que en el apartado 2.1, cargamos el vocabulario de aspectos para otros dominios de Yelp, que no son el de los hoteles:

Por ejemplo para el vocabulario del dominio de los libros, del archivo 'aspects/aspects_books.csv' tenemos 16 aspectos:

In [21]:
def count_aspects(aspects):
    count = 0
    for aspect in aspects:
        count += len(aspects.get(aspect))

    return count

Imprimimos su vocabulario de aspectos y el total de terminos asociados a estos aspectos.

In [22]:
aspects_books = dic_aspect("aspects/aspects_books.csv")
print("count_aspects: ", len(aspects_books) ," count_terms: " , count_aspects(aspects_books))
count_aspects:  16  count_terms:  147
In [23]:
print_aspects(aspects_books)
Out[23]:
Aspect Term
0 atmosphere {ambiences, light, lighting, ambiances, ambien...
1 characters {role, protagonists, roles, villains, enemies,...
2 coherence {coherences, consistencies, consistency, coher...
3 descriptions {visual-descriptions, descriptions, visual des...
4 ending {finale, closure, conclusion, completion, endi...
5 language {language, tone, words, languages, tones}
6 literary_style {narrative, narrative styles, literary styles,...
7 pacing {rhythmic score, cadences, rhythm, rhythms, ca...
8 pictures {chart, pictures, figure, charts, figures, pic...
9 price {price, pricing, fee, prices, fees, costs, cos...
10 scenes {scenes, sequence, scene, sequences}
11 script {dialogs, dialog, conversation, screenplay, sc...
12 start {introductions, presentations, presentation, p...
13 story {storylines, screen play, story-lines, story-l...
14 theme {matter, topics, topic, themes, matters, messa...
15 writer {writers, writer, authors, author}

Ahora lo ampliamos con WordNet

In [24]:
aspects_books2 = extend_vocabulary_wordnet(aspects_books)
print_aspects(aspects_books2)
Out[24]:
Aspect Term
0 atmosphere {flavor, light, miasm, lighting, status, glumn...
1 characters {role, protagonists, villains, roles, enemies,...
2 coherence {continuity, cohesion, coherences, connectedne...
3 descriptions {descriptions, visual description, visual desc...
4 ending {finale, suffix, closure, termination, inflect...
5 language {superstratum, lingua_franca, natural_language...
6 literary_style {narrative, style, narrative styles, literary ...
7 pacing {rhythmic score, musical_time, cadences, tempo...
8 pictures {computer_graphic, cyclorama, sonogram, graphi...
9 price {marginal_cost, price, inexpensiveness, pricin...
10 scenes {sequences, stage, dark, light, venue, country...
11 script {continuity, dialogue, dramatic_composition, s...
12 start {introductions, presentation, preamble, runnin...
13 story {storylines, screen play, sob_stuff, sob_story...
14 theme {matter, subject, head, topics, topic, bone_of...
15 writer {writer, wordsmith, reviewer, scriptwriter, wr...
In [25]:
print("count_aspects: " , (len(aspects_books2)) , " count_terms: " , count_aspects(aspects_books2))
diff = count_aspects(aspects_books2) - count_aspects(aspects_books)
print("\tdiff: " , diff)
count_aspects:  16  count_terms:  379
	diff:  232

Vemos como ahora hemos aumentado considerablemente el número de terminos, 232 para ser exactos en este ejemplo ya que pasamos de los 147 terminso originales a los 379 del aspecto extendido.

De igual modo, cargamos los vocabularios de el resto de archivos de la carpeta aspects,aunque por longitud de el notebook no los imprimos todos, por ser similares al vocabulario de los libros. Para cada uno de ellos podemos ver el número de aspectos:

In [26]:
aspects_cds = dic_aspect("aspects/aspects_cds.csv")
aspects_cds2 = extend_vocabulary_wordnet(aspects_cds)

print("Original\tcount_aspects: " , (len(aspects_cds)) , " count_terms: " , count_aspects(aspects_cds))
print("Extended\tcount_aspects: " , (len(aspects_cds2)) , " count_terms: " , count_aspects(aspects_cds2))
diff = count_aspects(aspects_cds2) - count_aspects(aspects_cds)
print("\tdiff: " , diff)
Original	count_aspects:  29  count_terms:  242
Extended	count_aspects:  29  count_terms:  669
	diff:  427
In [27]:
aspects_digitalmusic = dic_aspect("aspects/aspects_digitalmusic.csv")
aspects_digitalmusic2 = extend_vocabulary_wordnet(aspects_digitalmusic)

print("Original\tcount_aspects: " , (len(aspects_digitalmusic)) , " count_terms: " , count_aspects(aspects_digitalmusic))
print("Extended\tcount_aspects: " , (len(aspects_digitalmusic2)) , " count_terms: " , count_aspects(aspects_digitalmusic2))
diff = count_aspects(aspects_digitalmusic2) - count_aspects(aspects_digitalmusic)
print("\tdiff: " , diff)
Original	count_aspects:  37  count_terms:  356
Extended	count_aspects:  37  count_terms:  843
	diff:  487
In [28]:
aspects_movies = dic_aspect("aspects/aspects_movies.csv")
aspects_movies2 =  extend_vocabulary_wordnet(aspects_movies)

print("Original\tcount_aspects: " , (len(aspects_movies)) , " count_terms: " , count_aspects(aspects_movies))
print("Extended\tcount_aspects: " , (len(aspects_movies2)) , " count_terms: " , count_aspects(aspects_movies2))
diff = count_aspects(aspects_movies2) - count_aspects(aspects_movies)
print("\tdiff: " , diff)
Original	count_aspects:  23  count_terms:  252
Extended	count_aspects:  23  count_terms:  606
	diff:  354
In [29]:
aspects_phones = dic_aspect("aspects/aspects_phones.csv")
aspects_phones2 = extend_vocabulary_wordnet(aspects_phones)

print("Original\tcount_aspects: " , (len(aspects_phones)) , " count_terms: " , count_aspects(aspects_phones))
print("Extended\tcount_aspects: " , (len(aspects_phones2)) , " count_terms: " , count_aspects(aspects_phones2))
diff = count_aspects(aspects_phones2) - count_aspects(aspects_phones)
print("\tdiff: " , diff)
Original	count_aspects:  19  count_terms:  177
Extended	count_aspects:  19  count_terms:  388
	diff:  211
In [30]:
aspects_restaurants = dic_aspect("aspects/aspects_restaurants.csv")
aspects_restaurants2 = extend_vocabulary_wordnet(aspects_restaurants)

print("Original\tcount_aspects: " , (len(aspects_restaurants)) , " count_terms: " , count_aspects(aspects_restaurants))
print("Extended\tcount_aspects: " , (len(aspects_restaurants2)) , " count_terms: " , count_aspects(aspects_restaurants2))
diff = count_aspects(aspects_restaurants2) - count_aspects(aspects_restaurants)
print("\tdiff: " , diff)
Original	count_aspects:  40  count_terms:  343
Extended	count_aspects:  40  count_terms:  1197
	diff:  854
In [31]:
aspects_spas = dic_aspect("aspects/aspects_spas.csv")
aspects_spas2 = extend_vocabulary_wordnet(aspects_spas)

print("Original\tcount_aspects: " , (len(aspects_spas)) , " count_terms: " , count_aspects(aspects_spas))
print("Extended\tcount_aspects: " , (len(aspects_spas2)) , " count_terms: " , count_aspects(aspects_spas2))
diff = count_aspects(aspects_spas2) - count_aspects(aspects_spas)
print("\tdiff: " , diff)
Original	count_aspects:  40  count_terms:  332
Extended	count_aspects:  40  count_terms:  958
	diff:  626
In [32]:
aspects_videogames = dic_aspect("aspects/aspects_videogames.csv")
aspects_videogames2 = extend_vocabulary_wordnet(aspects_videogames)

print("Original\tcount_aspects: " , (len(aspects_videogames)) , " count_terms: " , count_aspects(aspects_videogames))
print("Extended\tcount_aspects: " , (len(aspects_videogames2)) , " count_terms: " , count_aspects(aspects_videogames2))
diff = count_aspects(aspects_spas2) - count_aspects(aspects_videogames)
print("\tdiff: " , diff)
Original	count_aspects:  21  count_terms:  231
Extended	count_aspects:  21  count_terms:  417
	diff:  727

Ahora imprimimos los aspects encontrados en alguna review de ejemplo de igual modo que hacíamos en el 2.1 para los datasets de reviews spas y restaurants junto con sus correspondientes vocabularios.

Mostramos un ejemplo con la primera review de Spas:

In [33]:
data_spas[0].get('reviewText')
Out[33]:
"Good tattoo shop. Clean space, multiple artists to choose from and books of their work are available for you to look though and decide who's style most mirrors what you're looking for. I chose Jet to do a cover-up for me and he worked with me on the design and our ideas and communication flowed very well. He's a very personable guy, is friendly and keeps the conversation going while he's working on you, and he doesn't dick around (read: He starts to work and continues until the job is done). He's very professional and informative. Good customer service combines with talent at the craft."
In [34]:
aspectsS = get_review_apects(data_spas[0].get('reviewText'), aspects_spas)
aspectsS
Out[34]:
Aspect Term
0 cleanliness clean
1 service service
2 shopping shop

Mostramos otro ejemplo, esta vez con la primera reseña del dataset de Restaurantes:

In [35]:
data_restaurants[0].get('reviewText')
Out[35]:
'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.Anyway, I can\'t wait to go back!'
In [36]:
aspectsR = get_review_apects(data_restaurants[0].get('reviewText'), aspects_restaurants)
aspectsR
Out[36]:
Aspect Term
0 bread bread
1 breakfast breakfast
2 breakfast morning
3 building garden
4 eggs eggs
5 food ingredients
6 food food
7 menu menu
8 menu meal
9 seating sitting
10 staff waitress
11 vegetables vegetable

TASK 3

Task 3.1 - mandatory: loading Liu’s opinion lexicon composed of positive and negative words, accessible as an NLKT corpus, and exploiting it to assign the polarity values to aspect opinions in assignment 4. Instead of this lexicon, you are allowed to use others, such as SentiWordNet. See Appendix F

Cargamos en primer lugar el lexicon de Liu, que incluye palabras con polaridad positiva y negativa.

In [37]:
negativeWords = opinion_lexicon.negative()
positiveWords = opinion_lexicon.positive()
print(negativeWords)
print(len(negativeWords))
print(positiveWords)
print(len(positiveWords))
['2-faced', '2-faces', 'abnormal', 'abolish', ...]
4783
['a+', 'abound', 'abounds', 'abundance', 'abundant', ...]
2006

Creamos un primera función basica que calcule la polaridad entre [-1,1] dependiendo de si la polaridad es positiva o negativa

In [38]:
def polarity_word(word):
    if word in positiveWords:
        return 1
    elif word in negativeWords:
        return -1
    return 0

La testeamos, vemos como una palabra como very la considera neutra.

In [39]:
print('great:', polarity_word('great'))
print('awesome:', polarity_word('awesome'))
print('ugly:', polarity_word('ugly'))
print('very:', polarity_word('very'))
great: 1
awesome: 1
ugly: -1
very: 0
In [40]:
#TODO, quitar esto si luego no lo usamos.

Cargamos el lexicon "Vader Opinion lexicon",que incluye emojis,abreviaciones y acrónimos.

In [41]:
f = nltk.data.load("vader_lexicon/vader_lexicon.txt")
lexicon = {}
for line in f.split("\n"):
    (word, polarity) = line.strip().split("\t")[0:2]
    lexicon[word] = float(polarity)
In [42]:
lexicon.get("")

Ampliamos la función de polaridad básica para que utilice el léxico de Vader, como este léxico devuelve polaridades en un rango diferente de valores, lo normalizamos para que devuelva valores entre [-1,1].

In [43]:
def polarity_word_Vader(word):
    if word in positiveWords:
        return 1
    elif word in negativeWords:
        return -1
    elif word in lexicon:
        if lexicon[word] >= 0:
            return 1
        else:
            return -1
    return 0
In [44]:
text = data_hotel[0].get('reviewText')
for word in re.findall(r"[\w']+", text):
    polarity = polarity_word_Vader(word)
    if polarity != 0:
        print('term: ', word , '\tpolaridad: ', polarity)
term:  awesome 	polaridad:  1
term:  lobby 	polaridad:  1
term:  great 	polaridad:  1

Combinando ambos lexicon's definimos una función para obtener la polaridad de un texto.

In [45]:
def count_polarity_words(text):
    count = 0
    words = 0
    
    for word in re.findall(r"[\w']+", text):
        if word in positiveWords:
            count += 1
            words += 1
        
        elif word in negativeWords:
            count -= 1
            words += 1
        
        elif word in lexicon:
            count += lexicon[word]
            words += 1
            
    return count/words

Mostramos un ejemplo, tomando de nuevo la primera review de Hoteles:

In [46]:
data_hotel[0].get('reviewText')
Out[46]:
"Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!"
In [47]:
count_polarity_words(data_hotel[0].get('reviewText'))
Out[47]:
0.7000000000000001

Definimos una función para que nos de el porcentaje de polaridad por review, dado un conjunto de reviews. Es decir, si una review está cargada de sentimientos y juicios de valor(buenos/malos) sobre diferentes aspectos. Cuanto mayor sea el valor de la polaridad más positiva será, es decir una polaridad negativa indicará una review crítica, una con una polaridad cercana a 0 indiracará una review neutra por lo general y una polaridad cercana a uno indicará una valoraciónbmuy positiva. Algo así como la review de una película de un crítico de cine y el comentario de un amigo que la ha visto.

En este ultimo ejemplo vemos que la polaridad es positiva, ya que tenemos fragmentos como "Great hotel" ,"awesome", "Great boutique rooms", "Awesome pool" "GREAT rooftop patio bar" "A great place to stay" que hacen que la opinión tomé una polaridad positiva. Pero no tenemos un 1 absoluto ya que tenemos fragmentos como "a very very busy lobby with Gallo Blanco attached" las cuales penalizan la opinión.

Pasamos a hacer una función auxiliar que evalue varias de las opiniones:

In [48]:
def count_polarity_data(data_list):
    count = 0
    for data in data_list:
        pol = count_polarity_words(data.get('reviewText'))
        count += pol
        print("reviewerID:", data.get('reviewerID'), "got a polarity of ", pol)
    print("Avg polarity: " , count/len(data_list))

Hacemos un ejemplo en el cual tomando las 10 primeras opiniones del dataset de Hoteles muestre las polaridades de las diez primeras reseñas y saque una media de las mismas. En este caso lo que obtenemos es que las diez primeras oiniones son en su mayoria positivas.

In [49]:
count_polarity_data(data_hotel[:10])
reviewerID: qLCpuCWCyPb4G2vN-WZz-Q got a polarity of  0.7000000000000001
reviewerID: rVlgz-MGYRPa8UzTYO0RGQ got a polarity of  0.5166666666666667
reviewerID: 4o7r-QSYhOkxpxRMqpXcCg got a polarity of  0.7291666666666666
reviewerID: msgAEWFbD4df0EvyOR3TnQ got a polarity of  0.8333333333333334
reviewerID: 0CMz8YaO3f8xu4KqQgKb9Q got a polarity of  0.48571428571428565
reviewerID: r5uiIxwJ-I-oHBkNY2Ha3Q got a polarity of  0.18947368421052635
reviewerID: zw-bIcZP4_VEi3UetomDeg got a polarity of  1.1272727272727272
reviewerID: 2rlBbFPHyZjXSFSE8r551w got a polarity of  0.9304347826086956
reviewerID: hw9irUmYqNe5zSW1V1ybbw got a polarity of  0.6888888888888888
reviewerID: 8PSUiFWz11tGdYy9xEFPZA got a polarity of  0.4076923076923077
Avg polarity:  0.6608643343054099

Aún así estos listados para calcular la polaridad no son únicos, hay más metodos que podríamos explorar tales como:

In [50]:
#https://stackoverflow.com/questions/38263039/sentiwordnet-scoring-with-python
print(list(swn.senti_synsets('happy')))
polarity = swn.senti_synset('happy.a.01')
print('pos', polarity.pos_score(), 'neg', polarity.neg_score())
[SentiSynset('happy.a.01'), SentiSynset('felicitous.s.02'), SentiSynset('glad.s.02'), SentiSynset('happy.s.04')]
pos 0.875 neg 0.0

Creamos una nueva función auxiliar basandonos en la libreria sentiwordnet y ver si altera nuestros resultados previos:

In [51]:
def count_polaritySentiment_words(text):
    count = 0
    words = 0

    for word in re.findall(r"[\w']+", text):
        sentiment = list(swn.senti_synsets(word)) 
        if(any(sentiment)):
            count += sentiment[0].pos_score()
            count -= sentiment[0].neg_score()
            words += 1
            
    return count/words
In [52]:
data_hotel[0].get('reviewText')
Out[52]:
"Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!"
In [53]:
count_polaritySentiment_words(data_hotel[0].get('reviewText'))
Out[53]:
0.026923076923076925

Vemos que en esta ocasión la polaridad de la sentencia es mucho más cercana a 0, lo cual denota cierta neutralidad, esto es principalmente debido a que esta libreria toma en cuenta más palabras que nuestra implementación casera, por lo que muchas palabras(neutras en su mayoria) aproximan la polaridad de la base a un valor más neutral.

Por otro lado nuestro metodo asignaba valores de +1/-1 dependiendo de si la palabra, mientras que aquí Sentiment_words no hace una valoración tan extrema.

In [54]:
def count_polaritySentiment_words_data(data_list):
    count = 0
    for data in data_list:
        pol = count_polaritySentiment_words(data.get('reviewText'))
        count += pol
        print("reviewerID:", data.get('reviewerID'), "got a polarity of ", pol)
    print("Avg polarity: " , count/len(data_list))
In [55]:
count_polaritySentiment_words_data(data_hotel[:10])
reviewerID: qLCpuCWCyPb4G2vN-WZz-Q got a polarity of  0.026923076923076925
reviewerID: rVlgz-MGYRPa8UzTYO0RGQ got a polarity of  0.01896551724137931
reviewerID: 4o7r-QSYhOkxpxRMqpXcCg got a polarity of  0.03798342541436464
reviewerID: msgAEWFbD4df0EvyOR3TnQ got a polarity of  0.03773584905660377
reviewerID: 0CMz8YaO3f8xu4KqQgKb9Q got a polarity of  0.004251700680272109
reviewerID: r5uiIxwJ-I-oHBkNY2Ha3Q got a polarity of  0.04369918699186992
reviewerID: zw-bIcZP4_VEi3UetomDeg got a polarity of  0.013297872340425532
reviewerID: 2rlBbFPHyZjXSFSE8r551w got a polarity of  0.008653846153846154
reviewerID: hw9irUmYqNe5zSW1V1ybbw got a polarity of  -0.025
reviewerID: 8PSUiFWz11tGdYy9xEFPZA got a polarity of  0.02533783783783784
Avg polarity:  0.019184831263967618

Tal como sospechabamos, este nuevo metodo se aproxima más a la neutralidad, por lo que no nos aporta infromación tan interesante como la primera implementación, por lo que tal vez Liu es más adecuado para este enfoque.

Task 3.2 - optional [low/medium difficulty]: considering modifiers to adjust the polarity values of the aspect opinions in Assignment 4. The modifiers to use could be those provided with the NLTK Sentiment Analyzer (see Appendix G) and/or those given in modifiers.csv

Probamos esta nueva versión para el calculo de polaridad siguiendo el apendice G. Como es habitual primero vamos a hacer una prueba con la primera review del datset de Hoteles.

In [56]:
constants = VaderConstants()
print(constants.BOOSTER_DICT)
analyzer = SentimentIntensityAnalyzer()
{'absolutely': 0.293, 'amazingly': 0.293, 'awfully': 0.293, 'completely': 0.293, 'considerably': 0.293, 'decidedly': 0.293, 'deeply': 0.293, 'effing': 0.293, 'enormously': 0.293, 'entirely': 0.293, 'especially': 0.293, 'exceptionally': 0.293, 'extremely': 0.293, 'fabulously': 0.293, 'flipping': 0.293, 'flippin': 0.293, 'fricking': 0.293, 'frickin': 0.293, 'frigging': 0.293, 'friggin': 0.293, 'fully': 0.293, 'fucking': 0.293, 'greatly': 0.293, 'hella': 0.293, 'highly': 0.293, 'hugely': 0.293, 'incredibly': 0.293, 'intensely': 0.293, 'majorly': 0.293, 'more': 0.293, 'most': 0.293, 'particularly': 0.293, 'purely': 0.293, 'quite': 0.293, 'really': 0.293, 'remarkably': 0.293, 'so': 0.293, 'substantially': 0.293, 'thoroughly': 0.293, 'totally': 0.293, 'tremendously': 0.293, 'uber': 0.293, 'unbelievably': 0.293, 'unusually': 0.293, 'utterly': 0.293, 'very': 0.293, 'almost': -0.293, 'barely': -0.293, 'hardly': -0.293, 'just enough': -0.293, 'kind of': -0.293, 'kinda': -0.293, 'kindof': -0.293, 'kind-of': -0.293, 'less': -0.293, 'little': -0.293, 'marginally': -0.293, 'occasionally': -0.293, 'partly': -0.293, 'scarcely': -0.293, 'slightly': -0.293, 'somewhat': -0.293, 'sort of': -0.293, 'sorta': -0.293, 'sortof': -0.293, 'sort-of': -0.293}
In [57]:
data_hotel[0].get('reviewText')
Out[57]:
"Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!"
In [58]:
analyzer.polarity_scores(data_hotel[0].get('reviewText'))
Out[58]:
{'neg': 0.0, 'neu': 0.675, 'pos': 0.325, 'compound': 0.99}

En este caso vemso que el metodo es capaz de analizar toda una sentencia y dar una valoración final dividida en varios puntos:

  • neg: polaridad negativa del texto
  • neu: polaridad neutral del texto
  • pos: polaridad positiva del texto
  • compound: suma de las polaridades positiva, negativa y neutral, normalizada entre -1 y 1

De estos valores el que más nos interesa es el de compound que sintetixa los otros tres, lo sustituiremos por nuestra función de polaridad en el 4.3.

TASK 4

Task 4.1 - mandatory: extracting the [aspect, aspect term, opinion word, polarity] tuples from the input reviews

POST Tagging

Definimos una función para extraer hacer POS tagging a un text(review), es decir, identificamos los adjetivos.

In [59]:
def pos_tagging(text):
    sentences = nltk.sent_tokenize(text)
    sentences = [nltk.word_tokenize(s) for s in sentences]
    sentences = [nltk.pos_tag(s) for s in sentences]
    return sentences

Vamos a verlo con el ejemplo usado anteriormente:

In [60]:
postagged_sentences = pos_tagging(reviews[0].get("reviewText").lower())
print(reviews[0].get("reviewText"))
print(postagged_sentences[0])
print(postagged_sentences[1])
Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!
[('great', 'JJ'), ('hotel', 'NN'), ('in', 'IN'), ('central', 'JJ'), ('phoenix', 'NN'), ('for', 'IN'), ('a', 'DT'), ('stay-cation', 'NN'), (',', ','), ('but', 'CC'), ('not', 'RB'), ('necessarily', 'RB'), ('a', 'DT'), ('place', 'NN'), ('to', 'TO'), ('stay', 'VB'), ('out', 'IN'), ('of', 'IN'), ('town', 'NN'), ('and', 'CC'), ('without', 'IN'), ('a', 'DT'), ('car', 'NN'), ('.', '.')]
[('not', 'RB'), ('much', 'JJ'), ('around', 'IN'), ('the', 'DT'), ('area', 'NN'), (',', ','), ('and', 'CC'), ('unless', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('familiar', 'JJ'), ('with', 'IN'), ('downtown', 'NN'), (',', ','), ('i', 'NN'), ('would', 'MD'), ('rather', 'RB'), ('have', 'VB'), ('a', 'DT'), ('guest', 'NN'), ('stay', 'NN'), ('in', 'IN'), ('old', 'JJ'), ('town', 'NN'), ('scottsdale', 'NN'), (',', ','), ('etc', 'FW'), ('.', '.')]

Vamos a definir una función para que nos devuelva los adjetivos dado un texto(en nuestro paso lo usaremos para las frases):

In [61]:
def get_adjetivos_sentence(sentence):
    
    s = nltk.word_tokenize(sentence)
    adjetivos = [(w,tipo) for w,tipo in nltk.pos_tag(s)  if tipo == 'JJ']
    return adjetivos
In [62]:
test_sentence = 'Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car.'
get_adjetivos_sentence(test_sentence)
Out[62]:
[('Great', 'JJ')]

Syntactic analysis of sentences

Consideramos la gramática basica: "adjetivo" + "nombre común/propio", por proporcionarnos resultados muy buenos experimentales a a la hora de detectar descripciones de aspectos.

Definimos la siguiente función que dado el texto de una review y un vocabulario de aspectos, nos da un resumen sobre la polaridad de los aspectos descritos en ella(ver ejemplo debajo). Para ello consideramos la polaridad de los adjetivos que van delante de términos, que pertenecen a nuestro vocabulario de aspectos. Todo ello junto al uso del POS Tagging + Syntactic analysis de NLTK.

In [63]:
def get_review_resume(review_text, aspect_vocabulary):
    ''' Devuelve el resumen de una review, como el ejemplo inicial de la práctica.(Solo para los 
            aspectos disponibles en el vocabulario pasado)
        Args:
            review_text: texto de la review
            aspect_vocabulary: vocabulario de aspectos
        Returns:
            rev_resume: df with aspect valoration found
    '''
    
    
    #Definimos la gramática usada para detectar adjetivos de nombres.
    grammar = r"""
     JJNN: {<JJ>*<NN>+} # chunk adjective and sequences noun
     {<JJ>*<NNP>+} # chunk sequences of proper nouns
    """
    cp = nltk.RegexpParser(grammar)

    #Obtenemos los aspectos presentes en la review
    aspectsH = get_review_apects(review_text, aspect_vocabulary)

    rows = []
    row = []

    for pos in pos_tagging(review_text.lower()):#USO DE POS TAGGING
        chunk_parse = cp.parse(pos)

        for child in chunk_parse:
            if isinstance(child, nltk.Tree):
                if child.label() == 'JJNN':#JNN:gramática 
                    word = ""
                    adjective = ""
                    polarity = 0
                    #aspects = ""
                    aspects = []
                    #print('child: ', child)
                    
                    #Según la casuística de nuestra gamática, exploramos los posibles casos
                    for i in range(len(child)):
                        if child[i][1] == 'JJ':#adjetivo
                            #calculo la polaridad del adjetivo por si esta asociado a un term conocido
                            polarity += polarity_word(child[i][0])
                            adjective += child[i][0] + " "

                        if child[i][1] == "NN" or child[i][1] == 'NNP':#Nombre
                            #miramos si la palabra es un aspect term + calculamos
                            word += child[i][0] + " "
                            aux = aspectsH.loc[aspectsH['Term'] == child[i][0]]['Aspect']
                            if(aux.empty == False):#tenemos aspect con esa palabra
                                #aspects += aux.values[0] + " "
                                aspects.append((aux.values[0] + " "))
            
                          
                    #Polarity Aspect Adjective Word
                    if(any(aspects)):
                        #Consideramos el aspecto de la ultima palabra:p.e: rooftop patio bar -->bar
                        row = [polarity, aspects[-1], adjective, word]
                    else:
                        row = [polarity, '', adjective, word]

                rows.append(row)
                
    rev_resume = pd.DataFrame(rows, columns =['Polarity', 'Aspect', 'Adjective', 'Word'])
    #Devolvemos SOLO las que contengan un aspect y un adjetivo de nuestro vocabulario
    df = rev_resume.loc[(rev_resume['Aspect'] != '') & (rev_resume['Adjective'] != '') ].reset_index(drop=True)
    return df
    

Probamos nuestra función con la review de ejemplo utilizada en todo el notebook:

In [64]:
data_hotel[0].get('reviewText')
Out[64]:
"Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!"
In [65]:
get_review_resume(data_hotel[0].get('reviewText'), aspects_hotel)
Out[65]:
Polarity Aspect Adjective Word
0 1 shopping great boutique
1 1 pool awesome pool
2 1 bar great rooftop patio bar
3 0 building busy lobby

Probamos ahora con el vocabulario extendido por wordnet del apartado 2.2 :

In [66]:
get_review_resume(data_hotel[0].get('reviewText'), aspects_hotel_extended_wordnet)
Out[66]:
Polarity Aspect Adjective Word
0 1 building great hotel
1 1 shopping great boutique
2 1 pool awesome pool
3 1 bar great rooftop patio bar
4 0 building busy lobby

Vemos como hemos ganado una entrada más en la tabla, al disponer de un vocabulario ampliado, lo cual nos interesa para extraer luego información sobre el hotel con todas las reviews sobre él.

Task 4.2 - optional [low/medium difficulty]: extracting the [aspect, aspect term, opinion word, modifier, polarity] tuples from the input reviews, taking the modifiers of assignment 3.2

In [67]:
def get_review_resumeSIA(review_text, aspect_vocabulary):
    ''' Devuelve el resumen de una review, como el ejemplo inicial de la práctica.(Solo para los 
            aspectos disponibles en el vocabulario pasado)
        Args:
            review_text: texto de la review
            aspect_vocabulary: vocabulario de aspectos
        Returns:
            rev_resume: df with aspect valoration found
    '''
    
    
    #Definimos la gramática usada para detectar adjetivos de nombres.
    grammar = r"""
     JJNN: {<JJ>*<NN>+} # chunk adjective and sequences noun
     {<JJ>*<NNP>+} # chunk sequences of proper nouns
    """
    cp = nltk.RegexpParser(grammar)

    #Obtenemos los aspectos presentes en la review
    aspectsH = get_review_apects(review_text, aspect_vocabulary)

    rows = []
    row = []

    for pos in pos_tagging(review_text.lower()):#USO DE POS TAGGING
        chunk_parse = cp.parse(pos)

        for child in chunk_parse:
            if isinstance(child, nltk.Tree):
                if child.label() == 'JJNN':#JNN:gramática 
                    word = ""
                    adjective = ""
                    polarity = 0
                    #aspects = ""
                    aspects = []
                    #print('child: ', child)
                    
                    #Según la casuística de nuestra gamática, exploramos los posibles casos
                    for i in range(len(child)):
                        if child[i][1] == 'JJ':#adjetivo
                            #calculo la polaridad del adjetivo por si esta asociado a un term conocido
                            polarity += analyzer.polarity_scores(child[i][0])['compound']
                            #ANTERIORMENTEpolarity_word(child[i][0])
                            adjective += child[i][0] + " "

                        if child[i][1] == "NN" or child[i][1] == 'NNP':#Nombre
                            #miramos si la palabra es un aspect term + calculamos
                            word += child[i][0] + " "
                            aux = aspectsH.loc[aspectsH['Term'] == child[i][0]]['Aspect']
                            if(aux.empty == False):#tenemos aspect con esa palabra
                                #aspects += aux.values[0] + " "
                                aspects.append((aux.values[0] + " "))
            
                          
                    #Polarity Aspect Adjective Word
                    if(any(aspects)):
                        #Consideramos el aspecto de la ultima palabra:p.e: rooftop patio bar -->bar
                        row = [polarity, aspects[-1], adjective, word]
                    else:
                        row = [polarity, '', adjective, word]

                rows.append(row)
                
    rev_resume = pd.DataFrame(rows, columns =['Polarity', 'Aspect', 'Adjective', 'Word'])
    #Devolvemos SOLO las que contengan un aspect y un adjetivo de nuestro vocabulario
    df = rev_resume.loc[(rev_resume['Aspect'] != '') & (rev_resume['Adjective'] != '') ].reset_index(drop=True)
    return df
    
In [68]:
data_hotel[0].get('reviewText')
Out[68]:
"Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!"
In [69]:
get_review_resumeSIA(data_hotel[0].get('reviewText'), aspects_hotel)
Out[69]:
Polarity Aspect Adjective Word
0 0.6249 shopping great boutique
1 0.6249 pool awesome pool
2 0.6249 bar great rooftop patio bar
3 0.0000 building busy lobby
In [70]:
get_review_resumeSIA(data_hotel[0].get('reviewText'), aspects_hotel_extended_wordnet)
Out[70]:
Polarity Aspect Adjective Word
0 0.6249 building great hotel
1 0.6249 shopping great boutique
2 0.6249 pool awesome pool
3 0.6249 bar great rooftop patio bar
4 0.0000 building busy lobby

Este método lo único que cambia es que estamos usando un calculo de la polaridad más suave, por lo que las polaridades de las sentencias son más "grises" y no tan extremas como en el primer método, de normal las opiniones no son tan drasticas de blanco/negro, por lo que toman un mejor valor en cuanto a la realidad frente a nuestro metodo "casero".

TASK 5

Task 5.1 - mandatory: visualizing on screen the aspect opinions (tuples) of a given review

Esta visualización es la que hemos hecho ya previamente en el apartado 4.1 para ver la salida de nuestra implementación. Repetimos la visualización por organización con el vocabulario extendido por organización del notebook(para ver mas detalles ver apartado 4.1)

RECUERDA: nuestra polaridad la damos en el intervalo [-1,1]

In [71]:
review = data_hotel[0]
In [72]:
review_text = review.get('reviewText')
get_review_resume(review_text, aspects_hotel_extended_wordnet)
Out[72]:
Polarity Aspect Adjective Word
0 1 building great hotel
1 1 shopping great boutique
2 1 pool awesome pool
3 1 bar great rooftop patio bar
4 0 building busy lobby

Task 5.2 - mandatory: visualizing on screen a summary of the aspect opinions of a given item. Among other issues, the total number of positive/negative opinions for each aspect of the item could be visualized

Ahora para cada hotel, con el conjunto de todas las reviews escritas sobre el, damos la polaridad de cada aspecto. Para ello definimos la siguiente función, que dado un hotel nos da el resumen que queremos.

Además para cada item(hotel), buscamos el top N de términos(con aspecto relacionado), que se mencionan sobre él. Esto nos es útil para ver información que es lo que más le llama la atención a la gente de un sitio concreto. En caso de empate se eligen al azar.

NOTA: (Para los gráficos se ha usado la version '4.14.3') de plotly.

In [73]:
from collections import Counter


def aux_fun_add_polarity_to_aspect(aspect,polarity,aspects_polarity ):
    
    if(not aspect) in aspects_polarity:
        aspects_polarity[aspect] = []
        
    aspects_polarity[aspect].append(polarity)


def aspect_opinion_by_item(item_id, reviews,aspect_vocabulary, debug = False, sort_by_ascending_plarity = False, n_most_common_words=5):
    '''
        Args:
            item_id: hotel id 
            reviews: reviews (file loaded into the notebook)
            aspect_vocabulary: vocabulario de aspectos
            n_most_common_words: get the top N commmon terms(with an ascpect identified) metieoned about the item
        Returns:
            resume(pandas dataframe) :Resume for a particular hotel gropued for each aspect

    '''
    #Filtramos por item
    reviews_of_item = [r for r in reviews if r.get('asin') == item_id ]
    print('NÚMERO TOTAL DE REVIEWS:', len(reviews_of_item))
    review_dfs = [get_review_resume(r.get('reviewText'), aspect_vocabulary) for r in reviews_of_item ]
    #df contains all rows from all item reviews
    df = pd.concat(review_dfs)
    
    if(debug):#For testing purpose
        print('------DEBUG----------------------------------------------------------')
        print('reviews:', len(reviews_of_item),'\n  :')
        a = [print(r.get('reviewText'), '\n\n')for r in reviews_of_item]
        print('all aspect concated for this hotel:')
        print(df)
        '''
        print('POLARIDADES')
        print('POSITIVAS\n',df_postive_polarity )
        print()
        print('NEGATIVAS\n',df_negative_polarity )
        print()
        print('NEUTRALES\n',df_neutral_polarity )
        print()
        '''
        print('--------------------------------------------------------------------\n\n\n')
        
    #Getint the top N metioned terms
    terms = df['Word'].values.tolist()
    most_commmon_terms = Counter(terms).most_common(n_most_common_words)
    
    df = df[['Polarity', 'Aspect']]
    
    #Counting Polatrity
    df_postive_polarity =df.loc[df['Polarity'] > 0].groupby('Aspect').agg(['count'])
    df_postive_polarity = df_postive_polarity.T.reset_index(drop=True).T
    df_postive_polarity = df_postive_polarity.rename(columns={0: "Referencias_Positivas"})

    df_negative_polarity =df.loc[df['Polarity'] < 0].groupby('Aspect').agg(['count'])
    df_negative_polarity = df_negative_polarity.T.reset_index(drop=True).T
    df_negative_polarity = df_negative_polarity.rename(columns={0: "Referencias_Negativas"})

    df_neutral_polarity =df.loc[df['Polarity'] == 0].groupby('Aspect').agg(['count'])
    df_neutral_polarity = df_neutral_polarity.T.reset_index(drop=True).T
    df_neutral_polarity = df_neutral_polarity.rename(columns={0: "Referencias_Neutras"})

    
    #Merging
    df_resume = df.groupby('Aspect').agg(Polaridad_Media = ('Polarity','mean'),Total_Referencias = ('Polarity','count'))
    df_resume = df_resume.merge(df_postive_polarity, on='Aspect', how='left')
    df_resume = df_resume.merge(df_negative_polarity, on='Aspect', how='left')
    df_resume = df_resume.merge(df_neutral_polarity, on='Aspect', how='left')
    df_resume = df_resume.fillna(0)
    
    df_resume = df_resume.astype({'Referencias_Positivas': 'int32'})
    df_resume = df_resume.astype({'Referencias_Negativas': 'int32'})
    df_resume = df_resume.astype({'Referencias_Neutras': 'int32'})


    if(sort_by_ascending_plarity):
        return df_resume.sort_values(by='Polaridad_Media', ascending=False), most_commmon_terms

    else:
        return df_resume, most_commmon_terms

Definimos también una función para visualizar el número de opiniones positivas/negativas/neutras como un gráfico de barras desglosado por aspectos. Dicho gráfico es interacctivo, permientiendo varios controles como ampliar la vista o tomar instantáneas:

In [74]:
import plotly.graph_objects as go


def plot_polarity_per_aspects(df_resumer_per_item):
    '''Function to plot aspect_opinion_by_item() resume '''

    df = df_resumer_per_item
    x = df.index.to_list()
    fig = go.Figure()
    y1 = df['Referencias_Positivas'].to_list()
    y2 = df['Referencias_Negativas'].to_list()
    y3 = df['Referencias_Neutras'].to_list()

    
    fig.add_trace(go.Bar(
        x=x,
        y=y1,
        name='Nº Referencias Positivas',
        marker_color='green'
    ))
    
    fig.add_trace(go.Bar(
        x=x,
        y=y2,
        name='Nº Referencias Negativas',
        marker_color='red'
    ))
    
    fig.add_trace(go.Bar(
        x=x,
        y=y3,
        name='Nº Referencias Neutras',
        marker_color='gray'
    ))

    # Here we modify the tickangle of the xaxis, resulting in rotated labels.
    fig.update_layout(barmode='group', xaxis_tickangle=-45)
    fig.show()
    
    

Definimos una función para ver en un diagrama de sectores las palabras más mencionadas:

In [75]:
def plot_pie_chart(most_commmon_terms):
    # This dataframe has 244 lines, but 4 distinct values for `day`
    x = [x for (y,x) in most_commmon_terms ]
    y = [y for (y,x) in most_commmon_terms ]
    fig = px.pie( values=x, names=y, title='Most metioned terms')
    fig.show()

Demo 1:

Para mostrar el funcionamiento, escogemos un hotel con 5 reviews, las monstramos y mostramos el resumen que hacemos con ellas:

In [76]:
item = 'CYMG5AsrhkhUPro2c6NSUA'
resume_1, most_commmon_terms_1 = aspect_opinion_by_item(item, data_hotel,aspects_hotel_extended_wordnet,debug = True)
resume_1
NÚMERO TOTAL DE REVIEWS: 5
------DEBUG----------------------------------------------------------
reviews: 5 
  :
I dug this place. Got a great deal from Priceline, and I pretty much knew what to expect going in. I've stayed at places like this before across the country, and I can definitely say that this was a very pleasant experience. Everyone has complained about the beds being hard, but I didn't really find this to be the case. I like a firmer mattress, so I guess that's just my personal preference, but it wasn't like sleeping on a campus futon or anything. The towels were nice- nothing fancy, but definitely not the thin washed-out wisps of cloth that you might have expected at other places. The setup of the place was great- it's just like having a little studio apartment. Even had a coffeemaker, and an ironing board, which is strangely becoming a rarity these days.I have to make special mention of one thing- the pillows, for some reason, were absolutely amazing. My girlfriend and I both agreed (and we're both flight attendants, so we KNOW hotels) that these were some of the best hotel pillows we've ever slept on. Couldn't exactly tell you why, but there was just something about them- proper amount of spring, cushioning, support, punchability, etc. If you need to be in Arizona on business, you couldn't pick a better place. The convenience is amazing- you're right near the highway and there's quite a bit of stuff within walking distance. We got some DELICIOUS takeout from some of the local restaurants, and it was SUPER Handy to have a microwave and kitchenette available within the room for our use!So yeah, I gotta say, I didn't mind this place one tiny little bit. If you're used to Marriotts, Hyatts, and other fancy hotels, then you need to know that this isn't one of them. But for a relatively no-frills "home away from home" type of place, this is perfect!! 


Our Hendersonville Video team arrived midday and everyone was checked in within a few minutes. This is like a two or three star hotel and is a good value for the money. Great parking and a nice desk staff. Ihop is only a block away. Very handy. 


I found this place on Hotels.com and booked one night so I could fly out of Phoenix early the next morning. Got in at 12 and asked if they could check me in early (I figured this was reasonable since I saw the maids already in the process of cleaning two rooms) - and was told the earliest I could check in was 3 p.m.?? Really? Now, I know they don't bend over backwards for anybody who gets a deal over Hotels.com since they're not making a premium amount........but you have rooms available, dude. "We were really busy last night and won't have anything until 3". Not even 2, 2:30? "No, earliest is 3 unfortunately". So based on them being filthy liars...........1 star.In addition, the entire room was in need of a complete overhaul. From the toilet that wasn't completely caulked in, to the leaky shower that points in the direction of the curtain and doesn't move, and the fact that you have to swipe your room key "exactly right" to open the damn door. Yes, you heard me......it literally took me 10 tries to get into the room. The front desk just laughed and said "yeah, we have the same problem". I mean, you don't have to fix it or anything........So based on them being idiots as well........1 star.For a grand average of......1star. 


It was a nice clean room. But I got in late about 11pm so the 3rd shift Guy was on duty. Well they lock the doors before 11 so he answer with a horrible attitude. He asked if I had a reservation I was like "duh" I'm here at this time of night. He let me in so we can do checks in. He (Billy) had the worse personality if my reservation had not been prepaid I'd left. I got to bed late so I didn't till about 11:20am oh you would have thought the earth had shaken 2people came by my room told me check out was at 11am. Then I was told they give me to noon. Thank you for the courtesy which should have been a given considering the economic situation Phoenix is in. Not like they were crawling with customers. So glad I'm changing to the embassy suites. Last night was just a crash pad for I got to Phoenix early. 


I checked in for a two night stay, so I could hit a local nightclub and not get in trouble with the popo. My check in was smooth and all staff members were very nice and made my stay comfortable. The room was setup like a small studio apartment; nothing to sing about, but it was very clean. Overall, I would recommend this place. 


all aspect concated for this hotel:
   Polarity      Aspect    Adjective                Word
0         0   building         star               hotel 
1         1      price         good               value 
2         1    parking        great             parking 
3         1      staff         nice          desk staff 
0         0  breakfast         next             morning 
1         0        bar       entire                room 
0         2        bar   nice clean                room 
1        -1    service     horrible            attitude 
2         0   location     economic   situation phoenix 
--------------------------------------------------------------------



Out[76]:
Polaridad_Media Total_Referencias Referencias_Positivas Referencias_Negativas Referencias_Neutras
Aspect
bar 1 2 1 0 1
breakfast 0 1 0 0 1
building 0 1 0 0 1
location 0 1 0 0 1
parking 1 1 1 0 0
price 1 1 1 0 0
service -1 1 0 1 0
staff 1 1 1 0 0
In [77]:
plot_polarity_per_aspects(resume_1)

La imagen de arriba nos puede resultar útil para de un simple vistazo ver que las reseñas tiene una valoración positiva en cuanto a su precio o el bar, mientras que tiene una reseña negativo frente a algunos servicios. Si bien nos podemos hacer una idea general de las reseñas podemos ver que hay cierta controversia en respecto al servicio y al staff.

Vemos los términos más comunes mencionados:

In [78]:
plot_pie_chart(most_commmon_terms_1)

Vemos como room que se repite 2 veces aparece en el top 1, algo logico teniendo en cuenta que estamos viendo reseñas de hoteles.

EJEMPLO DE USO:

Usando el mismo hotel que aparece en la review 0 que hemos estado usando como referencia, nos encontramos con 195 reviews, aquí sacamos el resumen de su polaridad:

Odenamos además los aspectos por polaridad positiva, para ver cuales son los puntos fuertes del hotel:

In [79]:
item = '8ZwO9VuLDWJOXmtAdc7LXQ'
resume_2, most_commmon_terms_2= aspect_opinion_by_item(item, data_hotel,aspects_hotel_extended_wordnet,debug = False, sort_by_ascending_plarity = True)
resume_2
NÚMERO TOTAL DE REVIEWS: 195
Out[79]:
Polaridad_Media Total_Referencias Referencias_Positivas Referencias_Negativas Referencias_Neutras
Aspect
temperature 1.000000 1 1 0 0
cuisine 0.833333 12 10 0 2
bathrooms 0.700000 30 23 2 5
parking 0.666667 12 7 0 5
drinks 0.666667 3 2 0 1
staff 0.653846 26 15 0 11
internet 0.636364 11 9 2 0
pool 0.574468 47 28 2 17
bedrooms 0.478261 23 12 1 10
location 0.459459 37 21 4 12
spa 0.400000 5 2 0 3
atmosphere 0.333333 36 15 3 18
building 0.313433 134 47 6 81
bar 0.296296 54 18 2 34
service 0.263158 19 10 5 4
booking 0.200000 5 1 0 4
shopping 0.200000 5 1 0 4
facilities 0.200000 15 4 1 10
breakfast 0.181818 11 2 0 9
price 0.157895 19 5 2 12
coffee 0.100000 10 2 1 7
amenities 0.000000 1 0 0 1
gym 0.000000 1 0 0 1
events 0.000000 8 0 0 8
transportation 0.000000 2 0 0 2
checking -0.333333 3 0 1 2
restrooms -1.000000 1 0 1 0
cleanliness -1.000000 1 0 1 0

Pintamos los resultados gráficamente, esto nos permite ver de un vistazo rápido una "valoración promedia" de los aspectos del hotel, siendo lo mejor valorado del hotel el edificio en si, la piscina, el bar o los baños.

In [80]:
plot_polarity_per_aspects(resume_2)

Pintamos los términos más mecionados(del top 5):

In [81]:
plot_pie_chart(most_commmon_terms_2)

Esto nos permite deducir que los topicos de las opciones del hotel hacen referencia a estos cunco aspectos, algo que encaja ya que la gran mayoria podían ser tags de busqueda de una web de hoteles.